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Abstract 

The basic idea of device-to-device (D2D) communication is that pairs of suitably selected wireless 
devices reuse the cellular spectrum to establish direct communication links, provided that the adverse 
effects of D2D communication on cellular users is minimized and cellular users are given a higher 
priority in using limited wireless resources. Despite its great potential in terms of coverage and capacity 
performance, implementing this new concept poses some challenges, in particular with respect to radio 
resource management. The main challenges arise from a strong need for distributed D2D solutions that 
operate in the absence of precise channel and network knowledge. In order to address this challenge, this 
paper studies a resource allocation problem in a single-cell wireless network with multiple D2D users 
sharing the available radio frequency channels with cellular users. We consider a realistic scenario where 
the base station (BS) is provided with strictly limited channel knowledge while D2D and cellular users 
have no information. We prove a lower-bound for the cellular aggregate utility in the downlink with 
fixed BS power, which allows for decoupling the channel allocation and D2D power control problems. 

An efficient graph-theoretical approach is proposed to perform the channel allocation, which offers 
flexibility with respect to allocation criterion (aggregate utility maximization, fairness, quality of service 
guarantee). We model the power control problem as a multi-agent learning game. We show that the 
game is an exact potential game with noisy rewards, defined on a discrete strategy set, and characterize 
the set of Nash equilibria. Q-learning better-reply dynamics is then used to achieve equilibrium. 

Index Terms 
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Channel allocation, game theory, graph theory, power control, Q-learning, underlay device-to-device 
communication. 


1. Introduction 

A. Related Works 

Device-to-device (D2D) communication as an underlay to cellular networks is regarded as one 
of the key technologies for enhancing the performance of future cellular networks O' The basic 
idea is to reuse cellular spectrum resources by allowing nearby wireless devices to establish 
direct communication links. This concept not only improves the efficiency of spectrum usage 
Q, but also has a great potential for enhancing the network performance expressed in terms of 
capacity, coverage, energy efficiency and end-to-end delays Q. In order to realize networked- 
controlled D2D communication as an underlay to cellular networks, a system designer faces 
some challenges, which mainly arise due to the lack of reliable channel state information (CSI) 
at base stations (BS). In particular, efficient feedback is the key to obtaining CSI; nonetheless, 
while CSI for cellular user^ can be efficiently acquired at a serving BS, such information is 
in general not available for D2D channels. The reason is the separation of the user/data plane 
from the control plane in the case of network-controlled D2D communication. An immediate 
consequence of this separation is that, in contrast to cellular users, D2D users cannot directly 
utilize pilot signals broadcasted by BSs for estimation of D2D channels. In addition, local 
transmissions of distinct pilot signals by each D2D user are infeasible and would not solve the 
problem due to pilot contaminationSince strategies for suppressing pilot contamination in D2D 
scenarios suffer from the need for increased feedback and control overhead, it is reasonable to 
assume that allocation of resources to D2D users has to be performed in a distributed manner 
under strictly limited CSI. Moreover, it is of utmost importance that direct transmissions among 
devices are coordinated to ensure that they do not have a detrimental impact on the performance 
of cellular users. Such coordination must involve a careful power-controlled allocation of D2D 
users to available radio frequency channels, primarily used by a BS (downlink frequencies) 


'in this paper, D2D userAink is used to refer to any pair of wireless devices that communicate directly, while any wireless 
device that operates in the traditional cellular mode is called a cellular user. 

^Pilot contamination refers to a situation, in which the use of a large number of pilot signals leads to a relatively strong 
interference that may deteriorate the quality of channel estimation. 
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and/or cellular users (uplink frequeneies). This problem, whieh is diffieult to solve even in a 
eentralized manner, is further aggravated in D2D setting by the need for distributed solutions. 

To date, numerous resouree alloeation sehemes are developed for underlay D2D eommuni- 
eation systems. Many of them, however, are only applieable to networks with limited number 
of D2D and/or eellular users. For instanee, Referenee Q studies the optimal ehannel alloeation 
and power eontrol where one eellular and two D2D users share wireless resourees. Similarly, in 
Q, the system model ineludes one eellular and two D2D users, and a game-theoretieal approaeh 
(reverse auetion) is proposed to solve the resouree sharing problem. Referenees [[^ and [j7| study 
a system with multiple D2D users; however, in every time slot, only one D2D user is allowed 
to transmit in a ehannel that is primarily alloeated to a eellular user. Similar examples inelude 
0 and p0| , among many others. 

Moreover, many works propose eentralized resouree alloeation sehemes for hybrid D2D and 
eellular eommunieation. The sehemes are mainly developed under the assumption that a eentral 
eontroller has aeeess to the global ehannel and network knowledge, and therefore is eapable 
of making eoordination and resouree alloeation deeisions not only for eellular users, but also 
for D2D users. For instanee, Referenee GD. formulates the joint ehannel alloeation and power 
eontrol problem as a mixed integer programming, whieh is solved using eolumn generation 


method. Similarly, in |12|, an energy-effieient uplink resouree alloeation seheme is proposed 


and analyzed by using mixed integer programming. The authors of [13| assume that a BS 
is able to perfeetly eoordinate the interferenee among eellular and D2D users. As another 


example, Referenee [14| formulates a joint density and power alloeation problem as a non- 
eonvex optimization problem using stoehastie geometry, and proposes an algorithm to solve the 
problem. A joint resouree alloeation and mode seleetion meehanism based on partiele swarm 
optimization is developed in p3| . See also p^ , p7| and [jT^ for further examples. 

In addition, in many researeh studies, some prior knowledge (sueh as information about utility 
funetions) is assumed to be known to D2D users. In most oases, the problem is then solved using 


game-theoretieal approaehes sueh as prioing [19|, [20|, auotions [211 or ooalition formation [22|, 


[23|, p4| |, [ |25| . Moreover, Referenee [ |26| | proposes a resouree alloeation meehanism based on 
eontraet design. Besides requiring prior knowledge at the node level, most game-theoretieal 
solutions impose large overhead due to the need for heavy information exehange in terms of 
bids, or priees and demands. 
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B. Our Contribution 

The system model eonsidered in this paper generalizes existing works in the following im¬ 
portant directions: 

• There is no limit on the number of cellular and D2D users that coexist in the network. 

• Multiple D2D users might be allowed to share a given channel with a cellular user. 

• The BS is only aware of statistical channel knowledge of cellular users and geographical 
locations of D2D users. This information can be simply acquired by using pilot signals for 
cellular users and GPS (Global Positioning System) data of D2D users. This means that 
implementing D2D transmissions do not impose any overhead. 

• D2D and cellular users do not have any channel knowledge. 

We first prove a lower-bound on the aggregate utility of cellular users. Based on this lower- 
bound, while taking the higher priority of cellular users into account, we decompose the resource 
allocation problem into two cascaded problems related to channel allocation and D2D power 
control. The former problem, which deals with maximizing the utility sum of cellular users, is 
a multi-objective combinatorial optimization problem that is very costly to solve with respect 
to the time and computational complexity. Therefore we propose a suboptimal, but efficient. 


graph-theoretical heuristic solution that involves maximum-weighted bipartite matching [27|, 


and minimum-weighted graph partitioning p9[ |, p0[ |. The problem can be then solved in a 
centralized manner by the BS, since the solution relies only on strictly limited information. The 
approach also offers high flexibility in terms of performance criteria, since quality of service or 
fairness can be also taken into account. The latter problem, in turn, deals with maximizing the 
aggregate utility of D2D users by means of power control, desirably in a distributed manner. We 
model the power control problem as a game with incomplete information, which, in contrast to 
most previous studies, is defined on a discrete strategy set. We show that this game is an exact 
potential game [[3T| and characterize the set of Nash equilibria. Furthermore, we use Q-learning 


better-reply dynamics [32| in order to converge to Nash equilibrium. Finally, extensive numerical 
analysis is performed to evaluate the performance of the proposed approach in practical cases. 


C. Organization 

The paper is organized as follows. In Section]^ we introduce the network model and formulate 
the resource allocation problem. Section |I^ is devoted to the first stage of the formulated problem. 
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i.e., centralized ehannel alloeation. Seetion deals with the seeond stage of the problem, i.e., 
distributed power eontrol. Seetion|V] presents numerieal evaluations, while Seetion [VT| eompletes 
the paper. 

D. Notation 

Throughout the paper we denote a set and its eardinality by a unique letter, and distinguish 
them by using ealligraphie and italie fonts, sueh as A and A, respeetively. Matriees are shown 
by bold upper ease letters, for instanee A. Moreover, A/ denotes the /-th eolumn of matrix A. 
Veetors are shown by bold lower ease letters, for example a. 

IT System Model and Problem Formulation 

A. System Model 

1) Network Model: We eonsider the downlink of a single-cell network with one BS denoted 
by b, and a set C eonsisting of L single-antenna eellular users, eaeh denoted by 1. The eell 
is provided with a set Q of Q = L orthogonal frequeney ehannels that are referred to by q. 
Throughout the paper, by the term D2D user we refer to a pre-defined pair of one single¬ 
antenna transmitter and one single-antenna receiver, whieh is represented either by k or by the 
pair {k, k'). Note that a single deviee ean be either transmitter or reeeiver. We use K, to denote the 
set of K D2D users. The BS is able to eommunieate with multiple eellular users simultaneously, 
possibly by means of multiple antennas. The data stream intended to any given eellular user 
is transmitted with fixed average power pc. Eaeh D2D user seleets a power level from the set 
M = \ where 1 < p^^^ < p® < • • • < We assume that p^^^ p^, 

sinee in general the BS has aeeess to larger energy resourees in eomparison with user deviees. 
Eaeh downlink frequeney ehannel q is used i) by the BS in order to transmit to some set C £ 
of Lq eellular users, and ii) by a set /C^ C /C of Kq D2D users for direet eommunieation. We 
assume that Lg = 1 V g G Q; that is, eaeh ehannel is assigned to exaetly one eellular user and 
therefore no vaeant ehannel exists. This assumption is made in order to proteet eellular users 
from an exeessive interferenee due to a high BS power. We use pd,g = (pi, ...,PKq) to denote 
the veetor of transmit powers of the D2D users that transmit through ehannel q. Throughout 
the paper, huv,q > 0 is the average gain of ehannel q from transmitter u to reeeiver v. We 
assume that huv,q = fuv^qduv, where 0 < fuv,q < 1 and 0 < < 1 stand for fast fading and 
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path loss, respectively. We assume that the channel gains of any given link are drawn from 
a stationary distribution. Moreover, due to channel reciprocity, we have huv,q = hvu,q- Signal- 
to-interference ratio (SIR) is denoted by 7. We consider a high SIR regime where 1 < 7, so 
that log (1 + 7) ~ log (7). When treating interference as noise, log (7) represents the achievable 
transmission rate of interference-limited point to point transmission. 

2) Utility Model: The utility of cellular user I G Cq that occupies channel q is defined a^ 


Rl{(l,Pd,q) 


log 


Pchbl,q 


1 + '^k&Kn Pkhkl 


( 1 ) 


which corresponds to the achievable transmission rate, as described before. Note that this utility 


model is widely used in litrature; see for example [33|. 

Since D2D users are subject to power control in addition to channel allocation, the utility of 
any D2D user k e ICq is defined to be 


Rk {q, Pd,q) = log 


Pkhkk' ,q 


1 + E 




Pjhjk',q T Pchbk',q 


- CPk, 


( 2 ) 


where c is a fixed price factor to penalize excessive power usage [34|. By definition, the utility 
of a D2D user corresponds to its transmission rate (see above) minus a cost that is paid to the 
cellular user in order to reimburse the adverse effects of spectrum sharing. The price factor can 
be either equal for all D2D users (as in Q) or selected proportional to the channel gain (or 


distance) between a D2D user and the cellular user transmitting in the same channel [351. Our 
analysis holds for both cases. 

3) Information Model: We consider a model with strictly limited information, as described 
in the following assumption. 


Assumption Al. Each of the following is assumed throughout the paper. 

a) The BS has knowledge of i) geographical locations of cellular and D2D users and the path 
loss exponent, thereby gik'i I & C,k ^ K, and ii) the average fading gain of cellular to BS 
links, i.e., hu^q W I E C,q E Q. 

b) The BS has no information about the fast fading component of cellular to cellular or D2D 
to D2D links. 


^Throughout the paper, all logarithms are natural. 
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c) Cellular and D2D users have no channel knowledge. 


B. Problem Formulation 

Network aggregate utility is eonventionally regarded as a measure for evaluating the perfor- 


manee of resouree management protoeols p6| , p7| , p8| , p9| |. Based on this eriterion, the 
problem is to alloeate channels and power levels to cellular and D2D users so as to maximize 
the network aggregate utility. With Q and Q in hand, this problem can be stated formally as 


Q 


rnaximize V V i?/ (g, Pd,q) + ^ Rk {q, Pd,q) , 

Q — 1 


(3) 


where Cq C C, ICg C 1C, pd^g G <S)k=i 0 denotes the Cartesian product. 


Note that unlike some previous works such as [40| and [411, the utility functions defined here 
are user-specific, i.e., the reward of any given channel differs to different users. As a result, the 
set of D2D and cellular users allocated to each channel is required to be determined, and not 
just the number of users. 

Such formulation however does not comply with the underlay D2D concept, and suffers from 
the following drawbacks that make it difficult or even impossible to deal with: i) The objective 
function in Q is not available at the BS due to the lack of information (see Assumption |A1| ), 
ii) The higher priority of cellular users is not taken into account, and iii) The objective function 
depends on both channel and power allocations that are mutually dependent. Therefore a solution 
to Q is difficult to obtain and is expected to be not amenable to distributed implementation. 
Our goal is therefore to develop a sophisticated heuristic approach. To this end, we first prove 
a lower-bound on the aggregate utility of cellular users that enables us to decouple the channel 
allocation and power control problems. 


Proposition 1. For any Pd,q,Pc (ind channel gains, we have 
Q Q Q 


q=l l&Cq 


Q 

E 

q=lleCq 


\og{pchbl,q) - 


(4) 


Q —1 l^£lq k^JC-q 


Proof: See Appendix VII-A 


In words, the lower-bound in Q corresponds to the worst-case scenario, in which all D2D 
users transmit at the maximum available power and the fast fading component of all D2D to 
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cellular links equals one, thereby eausing the maximum interferenee. Thus, for any realization of 
ehannel gains, the aeeuraey of the bound depends strongly on the range of the set of power levels 
A4, i.e., ~P^d^ ■ Apart from this, as the bound does not depend on D2D power alloeation and 

relies on the available information at the BS, it ean serve as a basis for resouree management. 

Sinee eellular users are assumed to have a higher priority and should be served first, we 
propose a two-step resouree alloeation strategy. In the first step, the objeetive is to maximize the 
lower-bound in (4) on the aggregate utility of eellular users. More preeisely, given Pc and 
imperfeet ehannel knowledge, we aim at assigning ehannels to eellular and D2D users so as 


Q 


maximize EE log iPchu,q 

q=\ l^Cq 


EEEri"W 

<?=1 l&Cq k&Kq 


(5) 


subjeet to 


Lg = 1, y q e Q. 


( 6 ) 


This problem is investigated in Seetion III 


Onee ehannels are alloeated, in the seeond step we address the power eontrol problem for 
D2D users, with the goal of maximizing the aggregate utility of D2D users as formalized below. 


Q 


Pd 


maximize 




^ ^ (g, Pd,q) 

q=l k&fCq 


(7) 


Seetion is devoted to this problem. 

Summarizing, the resouree alloeation problem is deeomposed into a ehannel alloeation problem 
for all users followed by a power eontrol problem for D2D users. As we see later, while the 
first problem is solved by the BS using a eentralized method, the seeond problem is solved by 
D2D users in a distributed manner. Using sueh a two-stage seheme, not only a higher priority of 
eellular users is taken into aeeount, but also D2D users utilize the assigned ehannels effieiently. 
Moreover, the limited available information is exploited with low eomputational effort. 


III. Channel Allocation 

This seetion deals with the first step of resouree management, i.e., ehannel assignment with 
the goal of optimizing the performanee of eellular users in terms of Q. 
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A. The Channel Allocation Scheme 


We notice that the first and second terms in Q are proportional to the sum of the desired 
signals and interferences over all cellular users, respectively. Moreover, while the first term 
depends only on cellular users, the second term depends on D2D users as well. Roughly speaking, 
the problem in (Bl) can be rephrased as maximize f{x) — g{x,y), where x and y respectively 


denote the cellular and D2D channel assignments. This problem is a multi-objective combinatorial 
optimization problem that is NP-hard and hence notoriously difficult to solve. Therefore we 
propose the following suboptimal, but simple and efficient, heuristic approach: At the beginning, 
we maximize the first term (weighted signal sum) so that the sets Cq, g G Q, are defined. 
Afterwards, given Cq, we allocate D2D users to frequency channels in a way that the second 
term (interference sum) is minimized. Formally, 


Q 

maximize EE log {Pchbl,q) (8) 

q=i lec, 

subject to @, and 

Q 

minimize EEE Pd^^Qki- (9) 

q=l l&Cq kelCq 

We call and @ as assignment and clustering problems, respectively. In the next two 
subsections, we show that these problems boil down to two classic graph-theoretical problems 
on the induced network graph, namely maximum-weighted bipartite matching and minimum- 
weighted partitioning. 

1) Assignment Problem: In the following, we show that problem ([^ can be formulated as a 
weighted bipartite matching, defined below. 


Definition 1 (Weighted Bipartite Matching). Let G = (y,£) be a weighted bipartite graph 
where V = Vi U V 2 , Vi fl V 2 = 0 and £ C Vi x V 2 . Each edge e & £ connecting any two 
vertices x G Vi and y E V2 is associated with some weight Wxy The weights are gathered in 
the Vi X V2 graph matrix denoted by W = [wxy]. 

Matching: A matching is a subset M C £ such that \/v E V at most one edge in Ai is incident 
upon V. 

Maximum Matching: A matching M. such that every other matching AA.' satisfies Wm' < Wm> 
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where Wm denotes the total weight of the selected edges for some matching A4. 

Minimum Matching: A matching M such that every other matching AA' satisfies Wm < Wm'- 


Based on Definition consider a bipartite graph Gl(V,£), with Vi = £ (the set of cellular 
users) and V 2 = Q (the set of channels). The weight of the edge connecting I G C and q ^ Q, 
wiq, is defined as the weighted average gain of channel q between the cellular user I and the BS, 
i.e., \og{pchu,q)- The problem is then to assign each cellular user a channel so that (|^ and ([^ 
are satisfied. Let the assignment be presented by an L x Q assignment matrix A = [aig], where 


t^lq 


1 if / G 
0 otherwise 


Therefore A satisfies the following constraints: 


( 10 ) 




—1 ! Z G {1, 2,..., L}, 
aiq G {0,1} , W l,q. 


( 11 ) 

( 12 ) 

(13) 


While ( fTTj ) implies that each channel serves at most one cellular user, ( [T^ means that each 
cellular user is served by exactly one channel. Note that equality holds in ([TT]) as we assume 


Q = L (see Section II-Al). The sum of edges’ weights yields 


Q Q 

EE EE Wlq. 

q=l l£C q=l l&Cq 


(14) 


Thus, the problem in ([^ subject to Q is equivalent to maximizing (fT4]), subject to ([TT|), ([T^, 


and (13), i.e., it corresponds to the maximum matching of Gl- 

2) Clustering Problem: This step consists of allocating channels to D2D users with the goal 
of minimizing the total interference to the cellular users over all channels. In order to address 
this problem we need to define the network graph. 


Definition 2 (Network Graph). The network graph for any channel q G Q is an undirected 
graph Gjq = (V, £) with V = Vi U V 2 , where Vi and V 2 represent the set of K DID transmitters 
and L cellular receivers, respectively. The weight of an edge between any pair of graph vertices 
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(x, y) is denoted by w^y, where w^y is equal to the average gain of channel q between x and y. 

However, by Assumption |A1[ only limited CSI is available at the BS; therefore the network 
graph eannot be eonstrueted. As a result, we define the estimated network graph, whieh ean be 
reproduced by the BS using the available information. 

Definition 3 (Estimated Network Graph). Estimated network graph is an undirected graph Ge = 
{y,£) with V = Vi U V 2 , where Vi and V 2 represent the set of K D2D transmitters and L 
cellular receivers, respectively. The weight of an edge between any D2D transmitter k and 
cellular receiver I is defined as Wki = pl^^gki- The weight of the edge between any two cellular 
users and any two D2D users are respectively equal to some constant C > Kpl^'^ and 

Next we show that problem Q can be rephrased as Q-way minimum-weighted graph parti¬ 
tioning on the estimated network graph Ge- 

Definition 4 (Q-way Weighted Partitioning). Let G = (V, £) be a weighted graph where each 
edge e ^ £ connecting any two vertices x and y is associated with some weight Wxy The 
weights are gathered in a V x V matrix denoted by W = [wxy\- The minimum-weighted Q-way 
partitioning problem divides the set of vertices into Q disjoint subsets in a way that the sum 
weights of edges whose incident vertices fall into the same subset is minimized. 

Now consider the estimated network graph, Ge- Then solving Q is equivalent to finding 
some {L -y K) X Q assignment matrix B = fijf that is defined to be 

(1 ifjeCqUlCg 

bjq = < . (15) 

I 0 otherwise 

Thus each column in B, e.g., B^ = [biq,b 2 q,---,b(^L+K)q]'^ ,<1 ^ {1,2, is an indicator 

describing cluster q. Therefore bjq satisfies the following constraints: 

b,q = Lq + Kq , ^ G (1, 2,Q}, (16) 


''Later we see that this definition results in some form of clustering by which the cellular to cellular and also the D2D to 
cellular interferences decrease. D2D to D2D interference is however neglected. This implies that in the absence of full and 
precise channel knowledge the priority is to protect cellular users. 
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^jq “1 5 J ^ {1; 2, ...,L + K}, 


(17) 


and 


6^9 e { 0 , 1 } , W j,q. 


( 18 ) 


The sum of edges’ weights eonneeting users in eluster q henee follows as 

2 "^jj'^jq^j'q ~ 2^q"^E^qy ( 1 ^) 

j&cujc j'&cujc 

where is the weight matrix of Gg- As a result, the total sum-weight of edges that are not 
out by the Q-way partitioning of Gg yields 

. Q 




q=i jeCuK. j'eCuK. 


^ Q ^ Q 

9 '^jj'^jq^j'q + 9 '^jj'^jq^j'q 


( 20 ) 


-?=1 j&K j'&K 
1 ^ 


q=i jec j'ec 


+2 X '^jj'^jq^j'q 


q=i jec j'&K 


The first term on the right-hand side of (20) is zero by the definition of G^;. Also, by the following 


proposition, the seoond term equals zero as well, sinoe any minimum-weighted partitioning 
assigns exaetly one oellular user to each cluster. 

Proposition 2. Any minimum-weighted Q-way partitioning of the estimated network graph Ge 
assigns exactly one cellular user to each cluster, that is Lg = 1 Vg G Q. 


Proof: See Appendix VII-B 


By Proposition and comparing ( fTO] ) with (20) we have 

Q Q 




5=1 


5=1 jec j'&K. 

5=1 jeCqj'eJCq 


( 21 ) 


By comparing ( [2T] ) with (|^ and by using the definition of Ge, it can be concluded that (|^ is 
equivalent to the minimum-weighted Q-way partitioning of G^. 
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Remark 1. As described in Section II-Al D2D user is referred to a pair of one single-antenna 
transmitter and one single-antenna receiver. Also, as described before, after clustering, any 
transmitter-receiver pair, which represents a D2D user, belong to a single cluster. As a result, 
i) no D2D transmitter communicates simultaneously with multiple receivers, and ii) no inter¬ 
cluster communication takes place; that is, communication occurs only between devices in the 
same cluster. 


B. Some Notes on Complexity 

In principal, the proposed ehannel alloeation seheme solves two problems, namely maximum- 
weighted matehing and minimum-weighted partitioning. The latter problem, however, ean be 
itself reformulated as a minimum-weighted matching, due to the special characteristics of the 
defined estimated network graph. This is deseribed formally in the following proposition. 


Proposition 3. Define a bipartite graph G'{y,£) where Vi = /C and V2 is produced by K 
times replicating C, i.e., V 2 = /^ U C... U C. The weight of any edge connecting some D2D user 

^ ■ V 

xK 

k & Vi to each copy f G V 2 (j G {1,...,K}) of some cellular user I E C is wik, that is, 
equal to the weight of the edge connecting k and I in the estimated network graph, Ge- Then 
the minimum-weighted Q-way partitioning of Ge is equivalent to a minimum-weighted bipartite 
matching of G'. 


Proof: See Appendix VII-C 


Therefore the algorithm is required to solve two (parallel) weighted matehing problems. 
Weighted matching is a classic graph-theoretical problem for which numerous effieient algo- 
rithmie solutions exist. A well-known solution is the Hungarian algorithm p7| . For a bipar¬ 
tite graph G{V,£), the spaee complexity of Hungarian algorithm yields 0{V‘^E) with V = 
max {Vi, V 2 }j 3 that is polynomial in the number of vertiees and also in the number of edges. 
The running time is O(V^), whieh is also polynomial in the number of vertiees. In our model, 
for the first matching we have V = L and E = Lf, by the definition of G^I^For the seeond 
matching, on the other hand, we have V = KL and E = (KL)^, by the definition of Ge and 


^In case Vi 7 ^ V 2 , dummy vertices are added. See for details. 

®This number of edges corresponds to the worst-case scenario where the bipartite graph is complete, i.e., there exists an edge 
between any pair x gVi and y € V 2 . 
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Proposition!^ Note that the two problems ean be solved simultaneously; henee the running times 
do not add up. More algorithmie solutions ean be found in p8| and [|42| for instanee. 


C. Quality of Service Guarantee and Fairness 

Despite being suboptimal, the deeoupling approaeh deseribed in Seetion |III-A| provides the 
possibility of solving the ehannel alloeation problem effieiently under different eonstraints. Two 
examples are given below. 

• Quality of serviee (QoS) requirement for eellular users: By problem Q, the goal of ehannel 
alloeation is to provide every D2D user with some transmission ehannel in a way that the 
aggregate utility of eellular users is maximized, thereby ignoring the individual performanees 
of eellular users. In many networks, however, eellular users require some speeifie QoS that 
restriets the amount of tolerable interferenee. Assume that eaeh eellular user I requires some 
minimum utility, by whieh its QoS is guaranteed. After solving problem ([^, eaeh 

eellular user is assigned a ehannel. Therefore, the nominator of Q is known. As a result, 
the maximum tolerable interferenee of eaeh eellular user I, say can be ealeulated 

based on We eonstruet a bipartite graph with Vi = /C and V 2 = C. The problem 

is then to assign as many as possible D2D users to eellular users (thus to ehannels) so 
that no interferenee experieneed by any eellular user exeeeds the maximum tolerable value. 
Formally, the problem is to find an iT x L assignment matrix X = [xki] so that 


L K 

maximize EE Xkh 


( 22 ) 


1=1 k=l 


subjeet to the following eonstraints: 


^ ^ ^kl ^kl — W G E, 


(23) 


fcex 


'^Xki <1, Vfc G /C, 


1=1 


and 


XfczG{0,1}, Wl,k. 


(24) 


(25) 


Note that by the definition of estimated network graph, Wki = P^^'’Qkh be., it is an upper- 
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bound of the interference experienced by cellular user I due to D2D user k. This problem 
is known as the generalized assignment problem which is NP-hard; nonetheless, efficient 


approximate solutions exist. See [ |43| | as an example. 

Fairness requirement: Here the problem is similar to the partitioning problem described in 


Section III-A2 with the additional requirement that the resulted clusters are balanced, in the 
sense that the interference experienced by cellular users due to D2D users are almost equal. 
Formally, desired is to solve (|^, subject to ( [T^ , ( jlTj ) and (18), so that '^keiCi '^ki ~ 

^ T^keiCg ^ki- It should be emphasized that in this context, 

the burden of D2D communication is divided (almost) equally among cellular users, which 
does not necessarily result in achieving equal utilities by all of them. 

IV. Power Control 

This section deals with the second step of resource assignment, i.e., D2D power control, which 
aims at optimizing the performance of D2D users. 

A. Power Control Game 

As described in the foregoing section, while performing the channel assignment, the BS ignores 
the potential interferences that might arise among D2D users, due to the lack of information and 
also their lower priority. In essence, D2D users are partitioned into clusters and each cluster is 
assigned a single channel. Given no information, each D2D user therefore intends to maximize 
its own utility, thereby causing interference to the users with whom it shares a channel. By power 
control, however, interference can be managed so that the channel assigned to each cluster is 
utilized efficiently. We model the power control problem as a game with incomplete information, 
defined on a discrete strategy set. We show that the game is potential and characterize the set of 


Nash equilibria. To this end, we define (exact) potential games [44| and Nash equilibrium [451. 


Definition 5 (Potential Game). Consider a strategic game 0 = {/C,X, where K. 

is the set of K players, X is the set of pure-strategy joint action profiles of all players, and 
Rk X ^ M+ denotes the payoff function of player k. Then 0 is an exact potential game if 
there exists a function u : X —)■ such that for all k E iC we have 


Rkijki^—k) ^kijkR—k) '^ifki^—k') ^i,^ki ^—k') •> 


(26) 
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where ik is the action of player k while i_fc denotes the joint action profile of all players except 
for k. Any such function v is called a potential of <5. 

Definition 6 (Nash equilibrium). A Joint strategy profile i = (zi, is called a pure- 

strategy Nash equilibrium if for all k E K, and all actions i'^, the joint strategy profile i' = 
{ii, ...fix) yields -Rfc(i') < Rk{i)- 


As clusters are assigned orthogonal channels, the actions of D2D users inside any given cluster 
do not affect the utilities of the users outside that cluster. Therefore the power allocation problem 
in any cluster q G {1, can be defined as a game among Kg D2D users. 


Definition 7 (Cluster Power Allocation Game). The power allocation game of cluster q G 
is a strategic game defined as 0q = |/Cg,X, where ICq is the set of 

D2D users assigned to channel q, X = of joint actions with 

realizations = (pi, ...,pKg), and Rk '■ X ^ M+ is the payoff function of player k G {1,..., iTg} 
defined in Q (Section^X^d. 

The main difference between the cluster power allocation game and the standard power 
control games investigated in other studies including [ |34| is that the strategy set of players 
is here extracted from a discrete space, while in the previous contributions the strategy space 
is continuous. Consequently, most of the existing results do not hold, and hence we proceed to 
the following theorem. 


Theorem a) The cluster power allocation game (Definition is an exact potential game 
with potential 

^iPd,q) = ^ log (Pk) - ^ cpk. (27) 

kelCq keKq 

b) Denote the set of potential maximizers by Vmax- Then, a joint action profile pd,q is a Nash 
equilibrium if and only if g G Vmax- 


Proof: See Appendix VII-D 


1) Quality of Service Guarantee: In Definition we assume that D2D users have no strict 


QoS requirement, and only aim at maximizing some reward, expressed in terms of SIR and 
cost. As a result, the set of joint strategies yields X = 0^1 \Pd\Pd\--^Pd'^^}- While this 
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formulation holds for many problems, there are some eases where D2D users need to meet some 
speeifie QoS requirements, expressed for instanee in terms of some minimum SIR value. In sueh 
seenarios, eaeh player tries to selfishly solve the following problem 

minimize pk (28) 

Pk&Ak{p-k) 

where Ak is the set of strategies for player k, whieh depends on the joint strategy profile of its 
opponents, p_fc, and is given by 


Ak = {pk ^ M : 'yk > Ta:} . 


(29) 


Here Ffc is the minimum required SIR for D2D user k to meet its QoS target. In other words, 
the players’ strategy sets are eorrelated so that any player plays only the aetions that satisfy 


its QoS eonstraint, given the aetions of opponents. It is known that the problem in (28) ean be 


modeled as a strategic game, where the utility of each player k is defined as Rk = —pk 
or Rk = — log (pk) [34|. Along similar lines with Theorem it is straightforward to show that 
the game is an exact potential game with potential v{pd,q) = -^fc(Pd,g), provided that the 


original problem (28) is feasible. 


B. Q-Learning Better-Reply Dynamics 

According to the system model, in the cluster power allocation game (Definition |^, the utility 
functions are not known by players (D2D users) in advance. Therefore they require interacting 
with the environment in order to i) learn the reward functions, and ii) achieve equilibrium. We 
consider the cluster power allocation game to be a game with noisy payoffs. In such games, for 
each joint action profile i G X of players, the utility achieved by player k at each interaction 
can be written as Rk = Rk{t) + e^, where Rk is the true expected value of the utility function 
Rk and is a random fluctuation with zero mean and bounded variance, independent from 
all other random variables. During the learning process, each player faces a trade-off between 
gathering information (learning) on the one hand and using information to achieve higher utility 
(control) on the other hand. This trade-off is known as exploration-exploitation dilemma. In order 
to deal with this dilemma and also to achieve equilibrium in a distributed manner, we use Q- 


leaming better-reply dynamics [32|. This strategy consists of three main steps that are performed 
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recursively: 1) Observe the personal reward and also the actions of opponents]^ 2) Update the 
Q-values of the played joint action profile. 3) With a small probability, e -C 1, select an action 
uniformly at random, while with a large probability, 1 — e, play according to the better-reply 
dynamics that is described in the following definition. 


Definition 8 (Better-Reply Dynamics [321). Assume that at some trial t — 1, a player k plays 
with action Pk,t-i- Then, at trial t, with probability C,k, the player selects the same action as in 
the previous trial, t — 1, i.e., pk,t = Pk,t-i- With probability 1 — C,k, however, the player selects an 
action according to a distribution that puts positive probabilities only on actions that are better 
replies to its (finite) memory than Pk,t-i- For instance, it selects an action according a uniform 
distribution over all better-replies. 


For readers’ convenience, the detailed strategy is described in Algorithm for some player 

k G ICq. 


Theorem 2 ([32|). The Q-learning better-reply dynamics (Algorithm^, with and A* given by 


( p0| ) and p2] ) respectively, converges to a pure Nash equilibrium in games with noisy unknown 
rewards that are generic and admit a potential function. 


Corollary 1. By using Q-learning better-reply dynamics, the cluster power allocation game 
(Definition 0 converges to a pure Nash equilibrium that maximizes the potential function. 

Proof: The proof directly follows from Theorem and Theorem ■ 

Remark 2. Let a = O be the size of the normal form representation of the cluster 

power allocation game. Similar to any other equilibrium-learning strategy. Algorithm ^follows 
a better-reply path to a pure Nash equilibrium, whose length grows exponentially in a On 
the other hand, as for Q-learning, the Q-value of all joint action profiles (that is equal to a) 
must be learned. As a result, the running time is at least exponential in the size of the game. 


^When using multi-agent Q-learning algorithms, conventionally it is assumed that every agent observes the state of the 
environment and/or the actions of its opponents |46| . In our model, players are therefore required to announce their transmit 
powers, for example by broadcasting in a specific time period, borrowed from the total transmission time. This overhead, however, 
is much less than that of the frequent and pairwise data exchange, for which usually a control channel is allocated |47| . The 
reason is that after convergence, which is achieved relatively fast, the transmit powers of players remain fixed. Therefore no 
more broadcasting is required and the borrowed time period is again available for data transmission. We also assume that the 
players have a finite memory of length m; that is, at each trial, each player remembers the played joint action profiles of exactly 
m past trials. 
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Algorithm 1 Q-Learning Better-Reply Dynamies [32| 


Select arbitrary positive constants c\ and Cg. 

Select learning parameters p\ G [f) l] ■ 

Let 6k,t be the mixed strategy of player k at time t. Let Sk,i be the uniform distribution over all actions (power 
levels). 

Select an action, pk,t, using 5k,i- Play and observe the reward, 
for t = 2,..., T do 
Let 


St = CeV 


(30) 


7: • With probability St, let 6k,t be the uniform distribution over all actions. 

• With probability 1 — et, perform the following (better-reply dynamics): 

- With probability Ck, let 6k,t be the Dirac probability distribution on pk,t-i- 

- With probability 1 — C,k 7 let 5k,t be the uniform distribution over all actions that are better replies to 
the full (finite) memory than pk,t-i- 

8: Using 6k,t, select the action of time t, pk,t, and play. 

9: Announce the selected action. Moreover, observe the played joint action profile of other players, p_fc and 

also the achieved reward, i?fc(Pd*q), where p^*^^ = {pk,t,P-k,t) = {pi,t 7 ■■,Pk,t, ■■ 7 PKg,t) ■ 

10: Update the Q-value of the played joint action profile as 

Qk,t+l{Pd^q) = Qk,t{Pd!q) + Ai (^RkiPd^q) - Qk,t{Pdl)) IpW , (31) 

with 

At = (cA + #‘[pg])”"\ (32) 


where #*[p)j 

11: end for 


(*) 1 
g‘ 


denotes the number of trials in which p^d \ played while 1^(4) is the indicator function. 


Pd,! 


i.e., 0(c°) for some constant c > 1. Thus, for a specific number of players (which is determined 
by clustering), smaller M (number of power levels) yields faster convergence, as one expects 
intuitively. Similarly, smaller M yields lower computational complexity. 

Remark 3. As described before, in any game, complexity and convergence speed to equilibrium 
depends dramatically on the size of the game. This dependency becomes even stronger for games 
with incomplete information, as the reward of all joint action profiles must be learned through 
successive interactions. As a result, it is of utmost importance to reduce the size of the game 
and/or to use any available information. The designed two-stage resource allocation mechanism 
strictly follows this policy, as by excluding cellular users from the set of players, and channels 
from the set of actions, the game size reduces abruptly in comparison with a one-stage game, 
while the available information at the BS is used efficiently. Additionally, it allows taking the 
priority of cellular users into account, which is not possible in a one-stage game. 
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C. Efficiency of Equilibrium 

According to Theorem for the cluster power allocation game, any pure-strategy Nash 


equilibrium maximizes the potential function, given by (27). It should be however noted that here 


the potential funetion is not equal to soeial welfare, f{pd,q) = '^k=i^kiPd,q)- Therefore, the 
pure-strategy Nash equilibrium does not necessarily maximizes the sum utilities of all players, 
although sueh a solution is desired. The ineffieiency of equilibrium is formalized by price of 
stability, defined below. 


Definition 9 (Price of Stability |49|). Let f{pd,q) be an objective function such as social welfare, 
which we wish to maximize. Moreover, let M denote the set of pure Nash equilibriums of the 
cluster power allocation game. Then the price of stability (PoS) is defined as 

max/(prf,g) 


PoS = 


max /(pd,g)' 


(33) 


Note that the objective function to be optimized and the solution set being evaluated might 
vary. For instance, the objective function could be the minimum reward (so that the optimization 
problem corresponds to max-min fairness eriterion), or the set of solution might also include 
mixed-strategy equilibria. The following proposition provides an upper-bound for the inefficieney 
of pure-strategy Nash equilibrium in the eluster power alloeation game. 


Proposition 4. Eor the cluster power allocation game described in Definition define 

Pd hkk',q 


7min := mm 


^ + '^j(z!CgJjLkPd '’hjk'^q+pjlbk',q 


(34) 


Then we have 1 < PoS < 




— log(7min) 


Proof: See Appendix |VII-E[ ■ 

Although the bound provided by Proposition]^ is loose, in general it elearly shows that a larger 
~P^d'^ value of (range of the set of power levels, M) may yield higher inefficieney of pure 
Nash equilibrium. Reeall that large range of M. has also an adverse effeet on the lower-bound 
given by (j^. Therefore the two-stage resouree alloeation mechanism is particularly suitable 
for M. with small ranges. It is worth mentioning that for games with multiple equilibriums. 


the inefficieney of the worst Nash equilibrium is formalized by price of anarchy (PoA) [50|. 
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Fig. 1. Network model consisting of D2D transmitters (Di, i € {1,12}) and cellular receivers (Ci, i € {1,5}). 


Calculating PoA is mathematically involved and lies out of the seope of this paper. 


V. Numerical Analysis 

We eonsider an underlay D2D eommunieation system, eonsisting of twelve D2D users (K = 
12) and five eellular users (L = 5), as depleted in Figure Note that only the transmitter side 
of D2D users are shown in the figure, as receivers do not cause any interferenee to eellular users 
and therefore do not impaet the ehannel alloeation (see also the definition of estimated network 


graph in Seetion III-A2). Also note that for numerical analysis, the locations of cellular and D2D 
users, as well as ehannel gains, are seleeted randomly. Aeeording to the system model (Section 


II-Al I, there exist five orthogonal ehannels {Q = 5). Eaeh D2D user k ^ K, seleets a transmit 
power from the set of power levels, M. = {2,4}. Moreover, the transmit power of the BS to the 
eellular users is Pc = 7. 


A. Channel Allocation 

Table l^ineludes hu,q (eellular-BS average channel gains) for /, g G (1,..., 5}, which is assumed 
to be known by the BS together with the network topology (Figure [^, aeeording to Assumption 
|A1 1 (Seetion [ni). Based on this information and by using the graph-theoretieal ehannel alloeation 
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scheme deseribed in Seetion III, the BS assigns eaeh (eellular and D2D) user a 
summarized in Table 11(a) Based on Table and Figure it ean be eoneluded 
ehannel alloeation given in Table |II(a)[ both ([^ and Q are satisfied. 


ehannel, as 
that by the 


TABLE I 

BS TO CELLULAR AVERAGE CHANNEL GAINS 


Channel 

User 

1 

2 

3 

4 

5 

Cl 

0.04 

0.01 

0.27 

0.12 

0.04 

C2 

0.29 

0.06 

0.15 

0.18 

0.26 

C3 

0.31 

0.46 

0.24 

0.19 

0.06 

C4 

0.12 

0.06 

0.29 

0.34 

0.16 

C5 

0.24 

0.08 

0.23 

0.41 

0.07 


As discussed in Section III-C it is also possible to change the criterion of channel allocation 
from maximizing the social welfare to address the QoS guarantee or fairness issues (of cellular 
users). Assume that the required QoS of any cellular user / G £ is satisfied if it achieves some 
minimum utility, say = 3.5Therefore by using the data given in Table the maximum 

tolerable interference of each cellular user can be simply calculated. A channel allocation that 


guarantees the QoS satisfaction of all cellular users is summarized in Table 11(b) Moreover, the 
result of channel assignment based on fairness among cellular users is shown in Table II(c)| 


The achieved average rewards of cellular users under all three criteria are shown in Figure 
1^ It can be seen that to achieve the highest utility sum, some cellular users do not experience 
any interference, while some others are strongly disturbed. In case of QoS guarantee, users with 
higher channel gains experience more interference and vice versa, so that at the end all cellular 


users are satisfied. Moreover, by Table 11(b), in the current setting, all D2D users can be served 
without violating the QoS requirement of cellular users In the last criterion, all cellular users 
experience almost equal amounts of interference, regardless of their achieved utilities. 


*Note that the QoS requirements of cellular users do not need to be necessarily similar. 

^Note that the solutions are approximately-optimal and also not unique. 

*°Clearly, this might not be always the case. In fact, given a specific QoS requirement of cellular users, the number of D2D 
users that can be served depends strongly on network topology, channel quality and the required QoS. 
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(a) Maximum Aggregate Utility (b) QoS guarantee 


Channel 

User 

1 

C5,D3,D9 

2 

C3,D1,D2,D11,D12 

3 

C1,D8,D10 

4 

C4,D6 

5 

C2,D4 


Channel 

User 

1 

C5,D1,D3,D9 

2 

C3,D2,D6,D7,D12 

3 

Cl 

4 

C4 

5 

C2,D4,D5,D8,D10,D11 


(c) Fairness 


Channel 

User 

1 

C5,D3,D9,D11 

2 

C3,D2,D12 

3 

C1,D1,D8 

4 

C4,D6,D7 

5 

C2,D4,D5,D10 


Maximum Utility Sum QoS Guarantee 

20 
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Fig. 2. Average utility and interference experienced by cellular users under three criteria (S:Sum). 


For our primary channel allocation criterion, i.e., maximizing the aggregate utility of cellular 
users, it is of interest to investigate the performanee loss of eellular users, eaused by sharing 
resourees with D2D users. The performanee degradation is shown in Figure where the 
achievable utilities of cellular users without any interference (no channel sharing) are shown 
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Cellular User Index 


Fig. 3. Performance loss of cellular users due to channel sharing with the allocation criterion being the maximization of cellular 
utility sum. 


in comparison with the case where all D2D users are assigned some channel. From this figure, 
it can be concluded that in the current setting, serving all D2D users costs approximately 15% 
performance loss to cellular users. 


B. Power Control 


From Table 11(a) it can be observed that minimum-weighted partitioning divides the D2D and 
cellular users into five clusters, each allocated a frequency channel. In this section, we investigate 
the power control game of the first cluster, i.e., the cluster that includes three D2D users (Dl, 
D3 and D9) and is assigned channel one. The games of other clusters are similar. The game 
horizon and price factor are considered to be T = 2 x 10^ and c = 0.1, respectively. The joint 


action profiles of the three users as well as their average rewards are given by Table III From 


TABLE III 

Joint Reward Table 


Joint Action 

Joint Reward 

Joint Action 

Joint Reward 

(2,2,2) 

(2.60,2.36,2.10) 

(4,4,2) 

(2.80,2.54,0.30) 

(2,4,2) 

(1.80,3.36,1.30) 

(2,4,4) 

(1.22,2.54,2.28) 

(4,2,2) 

(3.58,1.56,1.28) 

(4,2,4) 

(2.80,0.98,2.28) 

(2,2,4) 

(1.80,1.56,3.08) 

(4,4,4) 

(2.20,1.98,1.90) 


this table, the action profile (4,4,4), i.e., 


is the unique Nash equilibrium, which 
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Fig. 4. Fraction of trials in which any given action is played by D2D users. 
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Fig. 5. Utilities achieved by D2D users versus utility values at equilibrium. 

maximizes the potential function. Hence the game converges theoretically to this point. Figure 
1^ describes the frequency in which any given action is played by each D2D user. It can be seen 
that the equilibrium strategy is played almost all the time. Figure depicts the average utility 
of D2D users versus the equilibrium reward, confirming that in a short time the average reward 
of every player converges to that of equilibrium point. 

C. Overall Performance 

In order to evaluate the overall performance of the proposed resource allocation scheme, we 
compare it with three other strategies that are described below. 
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Fig. 6. Overall performance of the proposed scheme compared to some other strategies. 


• Centralized approach that is based on the exhaustive search given global information. In 
accordance with the concept of underlay D2D networks, the priority is here granted to the 
cellular users. Formally, the selected joint channel and power allocation vector maximizes 

broken in favor of the allocation vector that yields higher aggregate 
D2D utility, i.e., larger Ylk=i 

• Centralized approach that is based on the exhaustive search given global information, but 
without considering the priority of cellular users. Formally, the algorithm searches for the 
joint channel and power allocation vector that maximizes Yld=i + Yl!k=i 

• Random resource allocation, where the channel and power levels are assigned using uniform 
distribution. 

As applying the exhaustive search approach to the large network investigated before (Figure 
yields excessive complexity (5^® x 2^^ cases should be searched), we turn to a smaller network 
with L = Q = M = 2 and K = 6. Ten experiments are performed. For each experiment, 
independent from others, average channel gains and users’ locations are selected randomly. In 
other words, ten random simulation settings are selected. For each experiment, the sum of average 
rewards of all (cellular and D2D) users is simulated over T = 10^ trials. Results are depicted in 
Figure From this figure, it can be concluded that the utility achieved by our proposed resource 
allocation scheme is almost equal to the highest possible aggregate network utility when taking 
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the priority of cellular users into account. Note that the difference is due to i) bounding and 
decomposition techniques that are used in Section [nil and ii) the inefficiency of equilibrium that 


is described in Section IV Hence the performance gap is in fact the cost of i) absence of a 
coordinator, ii) lack of information, and iii) low time and computational complexity tolerance. It 
is also worth noting that larger network utility sum can be achieved by neglecting cellular priority; 
nevertheless, such setting does not comply with the concept of underlay D2D communication, 
since cellular users might be extremely disturbed. It is also worth mentioning that for larger 
number of D2D and cellular users, the number of possible channel and power allocation vectors 
grows exponentially, and hence centralized resource allocation based on exhaustive search yields 
excessive cost in terms of time and computational complexity, as well as a large overhead that 
is required for information acquisition. Our approach, in contrast, offers low complexity and 
overhead; hence it is specifically suitable for large networks. 


VI. Conclusion and remarks 

We studied an underlay D2D communication system, and proposed a two-stage resource 
allocation strategy that takes the priority of cellular users into account, and relies on strictly 
limited information. In the first stage, centralized channel allocation is performed by using a 
graph-theoretical method. The method offers high flexibility for selecting the allocation criteria, 
for instance aggregate utility, fairness or QoS guarantee. The complexity was shown to be 
polynomial in the number of users. In the second stage, power control problem is modeled 
as a game with incomplete information. We showed that the game is an exact potential game 
defined on a discrete strategy set, and therefore Q-leaming better-reply dynamics can be used 
by players to achieve a pure strategy Nash equilibrium in a distributed manner. The set of Nash 
equilibria was shown to be equivalent to the set of potential maximizers, and the inefficiency of 
Nash equilibrium was discussed. Extensive numerical analysis demonstrated the applicability of 
our approach, specifically in the context of large-scale networks. Moreover, the results showed 
that the number of D2D users that can be served depends on QoS requirement of cellular users. 
If no QoS requirement exists, serving all D2D users causes degradation of the cellular aggregate 
utility, depending on the channel qualities as well as the number of D2D users. In addition, 
it was concluded that using Q-learning better-reply dynamics results in a fast convergence to 
equilibrium. 
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VII. Appendix 

A. Proof of Proposition 

According to our system model, pk < p\^^ V k e JC. Moreover, huv,q = fuv^qQuv with 
0 < fuv,q < 1 and 0 < < 1. Hence, 

^ ‘ Pchbl,q 

Q —1 


V + ^k^lC,PkhktqJ \ 


Pchbl^q 


1 I (-^) 

+ Ylk^KaPd 9kl 


By basic properties of the logarithm, the right-hand side of (35) can be written as 


log {Pchbl,q) - 5^ log I 1 + vT'^dkl I > 

Q—1 l^jCq l^£lq y fcG/Cq 

Y1 {pchbi,q) -J2J2Y1 Pd^^Pki, 

^—1 l^jCq ^“1 l^C,q k^lQq 


(35) 


(36) 


where the inequality follows from the standard logarithm inequality, < log(l + a) < a, \/a > 


-1 [51|. 


B. Proof of Proposition 

We proceed by contraposition, i.e., we show that if {q G Q\Lq 7 ^ 1} 7 ^ 0 then the partitioning 
is suboptimal. 

Let C be the set of all possible Q-way partitioning forms of L + A' vertices of Ge- Assume 
that there exists some partitioning c G C, by which the graph is partitioned into Qa clusters with 
Lg > 1 . As L = Q (see Section II-A1[ ), there remain Qb = Q — Qa clusters with Lg = 0. In what 
follows, we show that partitioning c is suboptimal, by constructing another partitioning whose 
cost is less than that of c. 

Index Qa and Qb clusters of partitioning c by 1,Qa and Qa+1, Q, respectively. Moreover, 
let Ta and % correspondingly denote the aggregate sum weight of edges inside all clusters with 
and without cellular users. Thus we have 


Qa 


Ta = 

q=l l&Cq \j£Cq,jj^l kelCq 


(37) 
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and Tft = 0 by Definition Let denote the total cost of partitioning c. In order to establish 
that partitioning c is suboptimal, we show that 

Qa 

T^ = Ta + Ti,> mm ^ ^ 

q=l l&Cq 

To this end, we construct some partitioning c' with T^/ < T^. Assume that we change only one 
cluster of c, say cluster r e {1, Qa} with L,>1, by removing a cellular user J e Cr- Since 
all vertices must be included in the partitioning, J is added in some cluster r' G {1,Q} — {r}. 
Therefore, one of the following holds: 

. r' G {1, - {r}, or 

• r' G {Qa + 1, •••, Q}- 

It is clear that the first case results in the original problem. Hence, we assume that the cellular 
user J is included in r' G {Qa + 1, ■■■,Q}, and refer to the new partitioning by c'. Then we have 




Wkl 


(38) 




keJCn 


Tc' = Tc-'^ Wjj - ^ Wkj + ^ Wkj. ( 39 ) 

j&Cr fcS/Cr k^K.^1 

Since 0 < Wkj < we have 0 < < Kp^^\ for any clusters x. Moreover, as 

Cr > I and Wjj = C for j, J e C, then ^ Definition [^. Hence the 

worst-case occurs when: i)J2k&!Cr ~ which means that in cluster r, no D2D user causes 
interference to the cellular user J, ii) Xlfceyc / = Kp^^\ that is, cluster r' includes all D2D 

users that cause the maximum interference to the cellular user J, and iii) Yhj&Cr ~ 

Lr = 2. As a result, 

T,,<Ta-C + Kpf^<Ta, (40) 


as we assume C > Kp^j^^ by Definition ^ Therefore by (40) partitioning c is suboptimal, which 
is the contraposition and hence the proof is complete. 


C. Proof of Proposition 

By Proposition!^ any optimal partitioning of the estimated network graph Ge includes exactly 
one cellular user in each cluster; therefore we can assume that Wij = 0, Vz, j G C. Moreover, by 
Definition!^ Wij = 0 Vz, j G /C. Therefore we define a complete bipartite graph G with Vi = /C 
and 122 = C. The weight of the edge connecting k E 1C and / G £ is equal to the corresponding 
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edge in Ge, i-e., Wki- We then augment V 2 by K times replieating eaeh node I G C, resulting 
in a set C = ^ U £... U C ,. Using this set, a bipartite graph G' is eonstrueted, where Vi = 1C 

xK 

and V 2 = £'. The weight of an edge eonneeting any pair k e 1C to every eopy V G C of some I 
is Wkv = Wki- On graph G', a bipartite minimum-weighted matehing results in a iT x (iT x L) 
assignment matrix B = \bkv]^ so that the sum 

EE Wki'bkv (41) 

fce/c lec 

is minimized. For eaeh I, let the set of its eopies be denoted by Ui. Moreover, the set of all users 
k E K, that are assigned to any eopy of I is denoted by Ai. Thus ( |4T] ) ean be reformulated as 

L 

EEE bjiWjj'byi, (42) 

;=1 jeWi i'eA 

whieh is identieal to ( [2T] ). Henee the proposition follows. 

D. Proof of Theorem 

1) Some Auxiliary Definitions and Results: The proof is based on some auxiliary definitions 
and results that are briefly stated in the following. 


In what follows, v stands for a funetion defined on a diserete set X where X = Hje/ 

Xj = {xj G Z : Xj < Xj < Xi} C Z, and x^^Xi G Z. Moreover, ||x|| = \xi\ denotes the Zi-norm 
of a veetor x C Zf 


Definition 10 (Larger Midpoint Property (LMP)). We say that a function x : T" —)■ M satisfies 
the larger midpoint property (LMP) if for any x, y G A" with ||x — y|| = 2, 


maximum 

zGA:’:||x— z||=||y— z||=l 


/(z) > f/(x) + ( 1 -f)/(y) (3fG(0,l)), 


(43) 


or 


maximum 

zeA':||x— z|| = ||y— z||=l 


/(Z) 


> min{/(x),/(y)} 

> /(x) = /(y) 


if /(x) ^ /(y) 

o.w. 


(44) 


Definition 11 (Separable Coneave Funetion). A function v : X ^ R is separable concave if it 
can be written in the form n(x) = Yhi&i where vfxi) > j,. ^ 
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Lemma 1 ( [|3T|). If V . X —)• M w a separable concave function, then ( |?^ holds, and therefore 
V satisfies the larger midpoint property. 

Proposition 5 ( [|3T|). Let 0 be an exact potential game with a potential function v that satisfies 
the LMP property. Then i G X maximizes v if and only if it is a Nash equilibrium. 

2) Proof of Theorem [^- The proof consists of two parts. First we show that the power 
allocation game defined in Definition |7] is an exact potential game by deriving a potential function. 
This will prove the first part of Theorem Afterwards we establish that the potential function 
satisfies the LMP property, and we characterize the set of Nash equilibria using Proposition 
This will prove the second part of the theorem. 

Part One 

By Definition 1^ we need to find a function n : X —)• IR+ that satisfies (26). With Rk{i) given by 
(|^ we have 

Rkipk, p-fc) - Rkip'ki P-fc) = log - c{Pk - p'fc) 

Define 


(45) 


(46) 


k&K„ 


kefCn 


Then By simple calculus it follows that 


v{Pk, P-fc) - vip'k, P-k) = log - c{pk - p'k). 


(47) 


Therefore, according to Definition and by comparing ([47]) with (|45|), it can be concluded that 


the power allocation game is an exact potential game with potential function defined in (46). 

Part Two 


Lemma 2. The potential function of the cluster power allocation game (given by (@) is 
separable concave. 

Proof: Clearly, the potential function can be written as n(pd,g) = '^keiCq '^k{Pk) with 

Vk{Pk) = log(pfc) - cpk- (48) 
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Thus, by the assumption Pk > I (see Section II-A1| ), we have 

Vk{Pk + 1) + Vk{pk - 1) log(Pfc - 1) - 2cpfc 


< 


log(pD - 2cpfc 


(49) 


= log(pfc) - cpk. 

Therefore, by Definition the function is separable concave. 


Lemma 3. The potential function of the cluster power allocation game (given by (|46])) satisfies 
the larger midpoint property. 

Proof: The proof directly follows from Lemma and Lemma ■ 

Therefore, since the potential function satisfies the LMP property, the second part of Theorem 
follows directly from Proposition 


E. Proof of Proposition 


By Definition 


1 < PoS. Hence we only need to show that PoS < 


log(7min) 


. To this end, we 


need the following theorem. 


Theorem 3 ( [52|). Let 0 = {/C,X, {Rk}k^K] be a potential game with some potential function 


V (i). Also, let /(i) = Yl!k=i Rk{^)- Assume that for any joint action profile i, 


-/(i) < (i) < fif (i), 

a 


for some positive constants a and (3. Then PoS is at most afi. 


(50) 


For the cluster power allocation game, we have i := ^d,q, and v (pd,^) is given by (27). Also, 
by the definition of utility function given in (|^, we have /(pd,q) = Sfceyc, log( 7 i) - ^Pk- 


Besides, as 0 < fuv,q < 1 and 0 < ( 7 ^^ < 1 (see Section II-Al), at each trial, for any selected 
transmit power pk & M. and any player fc G /C, we have ymin < 7fc < Pk- Therefore, for any 

Pd,qt 

^(Pd,g) 


/(Pd,q 


> 1 . 


(51) 
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On the other hand, 

_ '^k&K.q log(Pfc) ~ J2k&ICg 

fiPd,q) J2k&Kg log(7A:) - Efce^c, CPfc 

^ EkeiCg log(Pfc) ^ ^Qg 

Efceyc, log(7fc) log (7 min) 

where the first inequality is eoneluded from ( [5T] ). Thus, by Theorem]^ the result follows. 
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